This is the University of Edinburgh PPLS Writing Centre workshop The aRt of the figure: Visualising your data using R. Feel free to browse the script beforehand if you’d like, but also don’t be intimidated by the amount of code: you will be guided through all the exercises in the workshop (except the bonus sections, which are meant as additional material to be explored on your own once you’ve learned the basics).
What you see below are the results of a “completed” script file. Once you do all the little exercises and complete all the bonus sections - and knit the R Markdown script into an HTML web page - this is more or less what you would see.
Just clicking on the file should work if RStudio is installed; if it just opens in the browser as a plain text file, instead of downloading, try right-clicking (CTRL+click on a Mac) on the link and selecting Save link as...).
If for some reason it get saved as a .txt file instead - apparently can happen on some Macs: make sure the name of the file is artofthefigure.Rmd and not artofthefigure.Rmd.txt - rename it to the former if necessary.
Also, a tip: if you mess up the plotting window somehow (e.g. by changing global options), use the dev.off() command to reset the graphics engine.
Finally, this file contains a lot of stuff that we will not cover in this short workshop due to time constraints. These sections are marked as “bonus” - you are more than welcome to explore and experiment with the commands in there on your own time to create nicer plots, better plots, and cool stuff like maps and networks. If you need help with presenting stats and figures in the future, feel free to book a session with me via the Writing Centre.
col for color) usually all have the same names.plot( c(1,4,10) ) # the basic plot command
# note: the c() command *c*ombines values into a vector; this will be useful later
We will be using some datasets built into R, as well as randomly generated data. It is not difficult to import your own data into R, but we’ll save time by not doing that. Importing data is covered in the last “bonus” section below if you’re interested.
runif(n = 1) # this command generates n random numbers
## [1] 0.1591806
sample(x = c(1,4,10), size = 1) # this produces a number of samples from the provided vector (x)
## [1] 4
Numerical values include things we can measure on a continuous scale (height, weight, reaction time), things that can be ordered (“rate this on a scale of 1-5”), and things that have been counted (number of participants in an experiment, number of words in a text).
We will use a built-in dataset called “iris” - it contains information about a bunch of flowers.
data("iris") # load the data into the workspace (or "global environment").
# In RStudio, you can have a look at the dataframe by clicking on the little "table" icon next to it in the Environment section (top right).
# We can also inspect the data using R commands.
head(iris) # prints the first rows
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
summary(iris) # produces an automatic summary of the columns
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
# Plotting time! Let's see for example how long the petals are in the dataset
iris$Petal.Length # the $ is used for accessing (named) column of a dataframe
## [1] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 1.5 1.6 1.4 1.1 1.2 1.5 1.3
## [18] 1.4 1.7 1.5 1.7 1.5 1.0 1.7 1.9 1.6 1.6 1.5 1.4 1.6 1.6 1.5 1.5 1.4
## [35] 1.5 1.2 1.3 1.4 1.3 1.5 1.3 1.3 1.3 1.6 1.9 1.4 1.6 1.4 1.5 1.4 4.7
## [52] 4.5 4.9 4.0 4.6 4.5 4.7 3.3 4.6 3.9 3.5 4.2 4.0 4.7 3.6 4.4 4.5 4.1
## [69] 4.5 3.9 4.8 4.0 4.9 4.7 4.3 4.4 4.8 5.0 4.5 3.5 3.8 3.7 3.9 5.1 4.5
## [86] 4.5 4.7 4.4 4.1 4.0 4.4 4.6 4.0 3.3 4.2 4.2 4.2 4.3 3.0 4.1 6.0 5.1
## [103] 5.9 5.6 5.8 6.6 4.5 6.3 5.8 6.1 5.1 5.3 5.5 5.0 5.1 5.3 5.5 6.7 6.9
## [120] 5.0 5.7 4.9 6.7 4.9 5.7 6.0 4.8 4.9 5.6 5.8 6.1 6.4 5.6 5.1 5.6 6.1
## [137] 5.6 5.5 4.8 5.4 5.6 5.1 5.1 5.9 5.7 5.2 5.0 5.2 5.4 5.1
iris[, "Petal.Length"] # this is the other indexing notation
## [1] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 1.5 1.6 1.4 1.1 1.2 1.5 1.3
## [18] 1.4 1.7 1.5 1.7 1.5 1.0 1.7 1.9 1.6 1.6 1.5 1.4 1.6 1.6 1.5 1.5 1.4
## [35] 1.5 1.2 1.3 1.4 1.3 1.5 1.3 1.3 1.3 1.6 1.9 1.4 1.6 1.4 1.5 1.4 4.7
## [52] 4.5 4.9 4.0 4.6 4.5 4.7 3.3 4.6 3.9 3.5 4.2 4.0 4.7 3.6 4.4 4.5 4.1
## [69] 4.5 3.9 4.8 4.0 4.9 4.7 4.3 4.4 4.8 5.0 4.5 3.5 3.8 3.7 3.9 5.1 4.5
## [86] 4.5 4.7 4.4 4.1 4.0 4.4 4.6 4.0 3.3 4.2 4.2 4.2 4.3 3.0 4.1 6.0 5.1
## [103] 5.9 5.6 5.8 6.6 4.5 6.3 5.8 6.1 5.1 5.3 5.5 5.0 5.1 5.3 5.5 6.7 6.9
## [120] 5.0 5.7 4.9 6.7 4.9 5.7 6.0 4.8 4.9 5.6 5.8 6.1 6.4 5.6 5.1 5.6 6.1
## [137] 5.6 5.5 4.8 5.4 5.6 5.1 5.1 5.9 5.7 5.2 5.0 5.2 5.4 5.1
plot(iris$Petal.Length) # two observations: there is quite a bit of variation, and it seems there are clusters in the data
hist(iris$Petal.Length, breaks=25) # a histogram shows the distribution of values (breaks change resolution)
boxplot(iris$Petal.Length) # a boxplot is like a visual summary()
points(x=rep(1, nrow(iris)), y=iris$Petal.Length) # could also add actual datapoints
# Here's something to try: the default color of the points is black. Change it to something else by adding the parameter col to the points command (remember, parameters are separated by commas, and they are given values using the = sign; color names must be in quotes, e.g., "red").
boxplot(iris$Petal.Length)
points(x=rep(1, nrow(iris)), y=iris$Petal.Length, col="red")
# Plot boxplots, grouped by the Species variable:
boxplot(iris$Petal.Length ~ iris$Species) # note the ~ notation
grid() # why not add a grid for reference
# A slightly nicer version:
boxplot(iris$Petal.Length ~ iris$Species, ylab="petal length", border=c("purple", "darkblue", "lightblue"), boxwex=0.5, cex=0.4)
abline(h=1:7, col=rgb(0,0,0,0.1)) # adds vertical lines instead of full grid
A Note on the rgb(red, green, blue, alpha) function: this allows making custom colors; alpha controls transparency. Possible values range between 0 and 1 by default. Below is a piece of code that generates an example of how the color scheme works (don’t worry if you don’t understand the actual code, this is above the level of this workshop; just put the cursor in the code block and press CTRL+SHIFT+ENTER (CMD+SHIFT+ENTER on Mac).
This short workshop is not intended to cover statistical concepts like correlation, but we will nevertheless quickly look at plotting two variables against each other.
plot(iris$Sepal.Length, iris$Sepal.Width) # no interaction?
# Why not color-code by species. This method also makes use of the [ ] indexing.
iriscolors= c("purple", "darkblue", "lightblue")
plot(iris$Sepal.Length, iris$Sepal.Width,
col=iriscolors[iris$Species], pch=20) # pch sets the point type
legend("topleft", pch=20, legend = levels(iris$Species), col=iriscolors, cex=0.7, bty="n") # add legend
# Whoa... This is suddenly a lot of code out of nowhere. If some of it looks overwhelming, worry not, everything will become clear once you get into R a bit more.
# Add detail to make this legible to the colour-blind, and printable in black-and-white
irispoints = c(15,16,17) # see help(points) for more
plot(iris$Sepal.Length, iris$Sepal.Width,
col = iriscolors[iris$Species] , pch = irispoints[iris$Species])
legend("topleft", pch=irispoints, legend = levels(iris$Species), col=iriscolors, cex=0.7, bty="n")
# Make this publication-ready by adding proper labels
plot(iris$Sepal.Length, iris$Sepal.Width,
col=iriscolors[iris$Species] , pch=irispoints[iris$Species],
main="Iris sepals in three species", xlab="Sepal length", ylab="Sepal width")
grid(col=rgb(0, 0, 0, 0.2)) # 0,0,0 results in black, 0.2 alpha makes it transparent (so, grey on a white background)
legend("topleft", pch=irispoints, legend = levels(iris$Species), col=iriscolors, cex=0.7, bty="n")
## Bonus: plotting regression lines
# This workshop does not cover actual statistical techniques, but in case you ever need to plot a regression line* - this is pretty simple in R:
# another iris plot
plot(iris$Sepal.Length, iris$Petal.Length, pch=irispoints[iris$Species])
grid(col=rgb(0,0,0,0.2), lty=1)
# do the regression analysis:
linmodel = lm(iris$Petal.Length ~ iris$Sepal.Length)
abline(linmodel, col=rgb(0,0,0,0.3), lwd=4) # abline can handle the output of the lm (linear model) command
# *Of course we already recognized that the data is more complex as initially thought (consisting of three distinct groups), so a proper regression model should take that into account.
While a whole subject on its own, we will have a quick look at plotting time series - data reflecting changes in some variable over time.
# This time we'll generate some random data and pretend it's real data.
# "The following data are reaction times to stimuli of one individual, over 100 trials, in an experiment on..."
retime = c(runif(20, 0,0.1), seq(0.1, 2.8,length.out = 80)) * runif(100, 0.7, 1.1 )
# Have a look at the raw data first
head(retime)
## [1] 0.006145227 0.059021327 0.061781357 0.092607855 0.061941686 0.031213055
# Now let's plot it
plot(retime, ylab="reaction time" )
plot(retime, ylab="reaction time", type="l" )
What can you tell by the looks of the data?
Exercise: improve this plot by adding the type parameter and setting its value as “l” (which stands for line, which are more useful in this instance), and set the X axis label (xlab) to say “trials” instead of the default “Index”.
## Bonus: plotting multiple (although still artificial) time series.
# First we need to create more data; let's bind it into a matrix as well
retimes = rbind(subj1=retime,
subj2=sort(rnorm(100, 1,0.3))*runif(100,0.4,0.8),
subj3=sort(rnorm(100, 1,0.3))*runif(100,0.9,1.1) )
# rows are subjects, columns are trials
# create an empty plot beforehand:
plot(NA, ylim=c(0,3), xlim=c(0,100), ylab="reaction time", xlab="trials"); grid()
# ...then add a line per subject; [1,] indexes the first row, etc:
lines(retimes[1, ], col="darkred")
lines(retimes[2, ], col="darkblue")
lines(retimes[3, ], col="darkgreen")
Categorical/nominal/discrete values cannot be put on a continuous scale or ordered, and include things like binary values (student vs non-student) and all sorts of labels (noun, verb, adjective). Words in a text could be viewed as categorical data.
Here is another artificial dataset. Let’s pretend I went around Edinburgh and asked random people on the street the following question: “A new species of insect was recently discovered in Scotland, and they called it Boubicus Boubasus - or Bouba for short. What’s your intuition, is Bouba a big fat bug, or a small slim bug?” (and the same for Kikis Kikosius, or Kiki for short)
boubakiki = data.frame(
meanings=c(
sample(c("big", "small"),25,T, prob=c(0.8,0.2)),
sample(c("big", "small"),24,T, prob=c(0.3,0.7))
),
words=c(rep("bouba",25), rep("kiki",24))
) # this command will create the random data
# Have a look at the raw data first.
# In addition to eyeballing the data, use the following commands: nrow(), dim()
nrow(boubakiki) # number of rows
## [1] 49
dim(boubakiki) # dimensions: number of rows and columns
## [1] 49 2
# Now let's use the table() function to make sense of it:
bktable = table(boubakiki)
mosaicplot(bktable, col=c("orange", "navy")) # a simple mosaic plot, displays proportions
barplot(bktable, ylab="big small") # a barplot, displays counts
## Bonus:
# You *could* also use heatmap(bktable), but a heatmap does not make much sense with just 2x2 categories - it's easier to just look at the actual table. To get an idea how it would look like with larger contingency tables (or things like similarity matrices), check out this plot (of some more random data):
heatmap(matrix(runif(20^2), 20,20, dimnames=list(letters[1:20], letters[1:20])), symm=T)
## Bonus 2:
# If you ever find yourself dealing with a correlation matrix (a matrix of correlation values between a bunch of variables; usually via the cor() function in R), then you might be interested in the corrplot package. It is similar to heatmaps, but generates plots with correlation matrices in mind. Do an install.package("corrplot") and try the following
library("corrplot")
## corrplot 0.84 loaded
corrplot(cor(iris[,1:4]) )
install.packages("wordcloud") # install a "package", a collection of functions to extend R's basic functionality; this needs to be done only once for each package.
# Let's create an object with a bunch of text:
sometext = "In a hole in the ground there lived a hobbit. Not a nasty, dirty, wet hole, filled with the ends of worms and an oozy smell, nor yet a dry, bare, sandy hole with nothing in it to sit down on or to eat: it was a hobbit-hole, and that means comfort. It had a perfectly round door like a porthole, painted green, with a shiny yellow brass knob in the exact middle. The door opened on to a tube-shaped hall like a tunnel: a very comfortable tunnel without smoke, with panelled walls, and floors tiled and carpeted, provided with polished chairs, and lots and lots of pegs for hats and coats—the hobbit was fond of visitors. The tunnel wound on and on, going fairly but not quite straight into the side of the hill — The Hill, as all the people for many miles round called it — and many little round doors opened out of it, first on one side and then on another. No going upstairs for the hobbit: bedrooms, bathrooms, cellars, pantries (lots of these), wardrobes (he had whole rooms devoted to clothes), kitchens, dining-rooms, all were on the same floor, and indeed on the same passage. The best rooms were all on the left-hand side (going in), for these were the only ones to have windows, deep-set round windows looking over his garden, and meadows beyond, sloping down to the river."
# Now let's do some very basic preprocessing to be able to work with the words in the text:
clean = gsub("[[:punct:]]", "", sometext) # remove punctuation (that weird thing inside the gsub (R's find-and-replace command) is a regular expression; don't ask, it just works)
cleanlow = tolower(clean) # make everything lowecase
words = strsplit(cleanlow, split=" ")[[1]]
# Inspect the object we just created. It should be a vector of 232 words.
# Some ways to inspect and visualize textual data
sortedwords = sort(table(words)) # counts the words and sorts them
plot(sortedwords, xaxt="n")
axis(1, 1:length(sortedwords), names(sortedwords), las=2, cex.axis=0.5) # add the words
# Time to use the wordcloud package we installed earlier
library("wordcloud") # load the package (needs to be done again when you start R again)
## Loading required package: RColorBrewer
wordfreqs = as.numeric(sortedwords) # get the frequencies from the table object
wordcloud(words = names(sortedwords), freq=wordfreqs, min.freq = 0) # needs the wordcloud package to work
# Note: if R gives you errors (saying word x could not fit), ignore them. Also, if plots look strange after using wordcloud, use the dev.off() command to reset graphics.
## Bonus: further cleaning for nicer clouds
# Ideally we would remove stopwords (the, and...) before plotting things like wordclouds. There are packages that do that (tm, text2vec) in various ways, or you could write some code to do it yourself (using clever math to give greater weight to "context" words, removing stopwords using lists or regular expressions, etc).
# A very simple approach is to just remove all short words:
sortedwords2 = sortedwords[which(nchar(names(sortedwords))>4 ) ] # look up the help files of the commands used here if you'd like to understand how this works exactly
wordcloud(names(sortedwords2), as.numeric(sortedwords2), min.freq = 0, col=terrain.colors(10))
## Warning in wordcloud(names(sortedwords2), as.numeric(sortedwords2),
## min.freq = 0, : going could not be fit on page. It will not be plotted.
# terrains.colors() is a "palette", a function that generates a number of colors from a predefined set
That’s it for today!
Do play around with these things later when you have time, and look into the bonus sections for more cool stuff. If you get stuck, Google is your friend; also, check out www.stackoverflow.com - this site is a goldmine of programming (including R) questions and solutions. If you get into making more plots, you might also want to look into the ggplot2 package, which offers an alternative way of making plots, which some people find more intuitive than the basic R way.
Also, if you get around analysing your own data and need help in terms of writing about the results of your analyses and presenting your data, choosing the right graph type, what to plot, and in general how to plot the things you want to plot - feel free to book a session with me through the PPLS Writing Centre: go to http://writingcentre.ppls.ed.ac.uk/appointments/ -> Book an appointment -> look for my name (Andres Karjus) in the list of tutors, in the ‘Select staff’ dropdown menu.
But wait! There’s one more thing to do. Since this is an R Markdown document, we can “knit” it into a nice HTML (or PDF, or Word) report file - it will show both the code and the plots produced by the code. Note that unfortunately this will not work if you have errors in your code - marked by the little red x signs on the left side vertical bar. To knit, click the Knit button (with the little blue ball of yarn) above the script window. If the code is without errors, an HTML document will appear.
Here are some more things you can try out at home later.
Small note: if you try knitting the RMarkdown file again later and would like to see output from the bonus sections, set eval=TRUE in these blocks, which will allow them to be rendered (all bonus blocks currently have the eval parameter set as FALSE). Do not set eval to TRUE in the small blocks with the install.packages() commands though. You might have also noticed the echo=F parameter - this just means the code itself will not be rendered in the knit output (even when it is executed).
Making maps programmatically based on data would come in handy if your worked with demographic data, or dialects, areal sociolinguistics, etc.
install.packages("raster") # install another package
library("raster")
## Loading required package: sp
uk = getData("GADM", country = "United Kingdom", level = 2) # download UK map (needs the raster package to be loaded)
par(mar=c(0,0,0,0)) # change plot area for better visibility - mar defines the margins, which are usually used for the axes and labels (reset using dev.off() later)
plot(uk, lwd=0.4) # plot the UK
grid() # add grid
points(-3.1833, 55.9533, pch=20, col="blue", cex=2) # your current location :)
plot(uk, lwd=0.4, xlim=c(-4,-1), ylim=c(55.7,55.8)); points(-3.1833, 55.9533, pch=20, col="blue", cex=2) # zoom in
A quick look at a package that creates interactive (clickable, rotatable, etc) plots, both in 2D and 3D - these also work in web pages (like the html file you could create by knitting this script file; R Markdown can also be used to create slides, meaning you could easily include interactive graphs in your next presentation).
install.packages("plotly") # install the package
library(plotly, quietly = T)
# another look at the iris data (note this uses magrittr's pipe %>% notation)
plot_ly(x= iris$Petal.Length,y=iris$Sepal.Width, z=iris$Sepal.Length, type="scatter3d",mode="markers",color=iris$Species) %>% layout(scene = list(yaxis=list(title="Sepal width"),xaxis=list(title="Petal length"), zaxis=list(title="Sepal.Length")))
# Remember the rgb color plots with the black background from earlier? Given there are 3 colors, they really would make more sense in 3D, no?
red=runif(1000);green=runif(1000);blue=runif(1000)
plot_ly(x=red,y=green, z=blue, type="scatter3d",mode="markers", color=I(apply(cbind(red, green, blue,0.5),1, function(x) rgb(x[1],x[2], x[3])))) %>% layout(paper_bgcolor='black', scene = list(xaxis=list(title="red"),yaxis=list(title="green"), zaxis=list(title="blue")))
Once you get around to working with your own data, you’ll need to import it into R to be able to make plots based on it. There are a number of ways of doing that.
This is probably the most common use case. If your data is in an Excel file formal (.xls, .xlsx), you are better off saving it as a plain text file (although there are packages to import directly from these formats, as well as from SPSS .sav files). The commands for that are read.table(), read.csv() and read.delim(). They basically all do the same thing, but differ in their default settings.
# an example use case with parameters explained
mydata = read.table(file="path/to/my/file.txt", # full file path as a string
header=T, # if the first row contains column names
row.names=1, # if the 1st (or other) column is row names
sep="\t", # what character separates columns in the text file*
quote="", # if quotes in text in any columns, set this to ""
)
# * "\t" is for tab (default if you save a text file from Excel), "," for csv, " " if space-spearated, etc
# for more and to check the defaults, see help(read.table)
# the path can be just a file name, if the file is in the working (R's "default") directory; use getwd() to check where that is, and setwd(full/path/to/folder) to set it (or you can use RStudio's Files tab, click on More)
There is a simple way to import data from the clipboard. While importing from files is generally a better idea (you can always re-run the code and it will find the data itself), sometimes this is handy, like quickly grabbing a little piece of table from Excel. It differs between OSes:
mydata = read.table(file = "clipboard") # in Windows (add parameters as necessary)
mydata = read.table(file = pipe("pbpaste")) # on a Mac
For text, the readLines() command usually works well (its output is a character vector, so if the text file has 10 lines, then readLines produces a vector of length 10, where each line is an element in that vector (you could use strsplit() to further split it into words. If the text is organized neatly in columns, however, you might still consider read.table(), but probably with the stringsAsFactors=F parameter (this avoids making long text strings into factors, read up on it if needed).
RStudio has handy options to export plots - click on Export on top of the plot panel, and choose the output format. Plots can be exported using R code as well - this is in fact a better approach, since otherwise you would have to click through the Export… menus again every time you change your plot and need to re-export. Look into the help files of the jpeg and pdf functions to see how this works.
There are a number of ways for creating animated plots in R and making nice GIFs that you can use in a talk, on your website or wherever. There is the animation package, or on a Mac you can use ImageMagick’s Terminal commands to convert plot files into a GIF (you can send commands to Mac’s Terminal using the system() command; learn about loops to easily generate a number of plots with only a few lines of code).
# a simple loop that generates 12 plots
par(mfrow=c(4,3), mar=c(2,2,0,0)) # sets the plotting window to accommodate multiple plots and reduces margins
x = runif(12)*runif(12) # random numbers
for (i in 1:12){ # this is a for-loop
plot(x[1:i], xlab="", ylab="", type="o", xlim=c(1,12), ylim=c(0,1))
}
# save these as .png files instead (use png() ) and then convert the pngs into gif
There are also packages to import (and manipulate) images, GIS map data (shapefiles), data from all sorts of other file formats (like XML, HTML) and many more. Just google a bit and you’ll find what you need.